Module 01
“The simple graph has brought more information nto the data analyst’s mind than any other device.”
—John Tukey
Exploratory Data Analysis by John Tukey(Tukey 1977), is now considered a classic in the field of data analysis and statistics.
Four chapters are devoted to Graphic Presentation in my copy of Applied General Statistics (Croxton and Cowden 1946). (The book was first published in 1939.)
R for Data Science is an introduction into data manipulation and visualization. The authors are proponents of the tidyverse and ggplot2. The tidyverse is a collection of R packages designed for data science. This is in contrast to base R.
The tidyverse provides an integrated framework that allows beginners to quickly get up to speed with data manipulation.
ggpot2 is a plotting system for R, based on the grammar of graphics. Once you become familiar with ggplot, you will see it’s presence in many publications. A Layered Grammar of Graphics (Wickham 2010) provides the philosophical framework for ggplot2.
Before you begin any readings, you should have R and RStudio installed on your computer.
Follow the instructions on the Posit.co website for installing the RStudio IDE (integrated development environment).
Once you have R and RStudio installed, start RStudio and type library(tidyverse) in the console.
You’ll see the following message the first time you load the package.
The Palmer Penguins dataset is a popular dataset for learning data visualization. It is bundled with the palmerpenguins package. The dataset was created by Allison Horst, Alison Hill, and Kristen Gorman. The dataset is available on GitHub.
Data frames will be the default data structure we use in this course. Data frames should look familiar to anyone who has used spreadsheets.
Variables are in columns and observations are in rows.
library(ggthemes)
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species)) +
labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper Length (mm)",
y = "Body Mass (g)",
color = "Species"
) +
scale_color_colorblind()Warning
Warning: [38;5;232mRemoved 2 rows containing missing values or values outside the scale range (geom_point()).[39m
Important
When aesthetic mappings are defined in the ggplot() function, they are inherited by all layers.
The aesthetic “color” is being applied to both the geom_point() and geom_smooth() layers.
```{r}
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper Length (mm)",
y = "Body Mass (g)",
color = "Species",
shape = "Species"
) +
scale_color_colorblind()
```ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = "lm") +
labs(
title = "Body mass and flipper length",
subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
x = "Flipper Length (mm)",
y = "Body Mass (g)",
color = "Species",
shape = "Species"
) +
scale_color_colorblind()Create a new Quarto html document and answer questions 1 through 10 in the R for Data Science section 1.2.5 Exercises.
The NIST/SEMATECH e-Handbook of Statistical Methods is a collaborative project involving the National Institute of Standards and Technology (NIST) and SEMATECH.
NIST is a non-regulatory federal agency within the U.S. Department of Commerce. The main role of NIST is to promote U.S. innovation and industrial competitiveness by advancing measurement science, standards, and technology.
SEMATECH was a research consortium comprised of semiconductor manufacturers and suppliers.
Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to
- maximize insight into a data set;
- uncover underlying structure;
- extract important variables;
- detect outliers and anomalies;
- test underlying assumptions;
- develop parsimonious models; and
- determine optimal factor settings.
The particular graphical techniques emplooyed in EDA are often quite simple, consisting of various techniques of:
- Plotting the raw data. (Scatter plots, histograms, probability plots, etc.)
- Plotting simple statistics. (Mean plots, standard deviation plots, box plots, etc.)
- Positioning such plots to maximize our natural pattern-recognition abilities, such as using multiple plots per page. (Subplots, faceting, etc.)
The primary goal of EDA is to maximize the analyst’s insight into a data set and into the underlying structure of a data set, while providing all of the specific items that an analyst would want to extract from a data set, such as:
- a good-fitting, parsimonious model;
- a list of outliers;
- a sense of robustness of conclusions;
- estimates for model parameters;
- uncertainties for those estimates;
- a ranked list of important factors;
- conclusions as to whether individual factors are significant;
- optimal settings.
Data from a process or experiment “behaves like”
- a random drawing;
- from a fixed distribution;
- with the distribution having a fixed location; and
- with the distribution having a fixed variation.
What are categorical variables in the Palmer Penguins dataset?
What do we calcluate with distributions?
Create a new Quarto html document and answer questions 1, 2, and 6 in the R for Data Science section 1.5.5 Exercises.
Applied Statistical Techniques